Introduction

Our team sought to answer an intriguing question: Can we predict the political party of twitter users from the words they tweet? After some discussion, we narrowed this question down to inference on current members of Congress. To this end, we utilized the Twitter API to gather the last year’s tweets from all Senate and House members. Taking a random sample of tweets, we distilled this huge mine of information into the density of the words used by each user, as a ratio compared to the person who used the word the most often.

With this data we performed unsupervised learning techniques like principal component analysis (PCA) and clustering, as well as supervised techniques like logistic regression and the random forest. Our aim in performing these methods was to infer from the data. By making models with improved predictive capability, we can gather more acute insights into the structure of the data and the language associated to political party.

Note: in general, the data transforms used take a while to run, so we pre-load the transformed data and only run the necessary code.

The Data

All files referenced in this section are in the DataCollection folder. Our sources for data collection are the two files representatives.txt and senators.txt, taken from the GWU Libraries Dataverse. These files contain the last 3,200 tweets from every member of the 115th Congress (the current session), excepting four members of the House who don’t have official Twitter accounts: Collin Peterson (D-MN-07), Lacy Clay (D-MO-01), Madeline Bordallo (Guam delegate), and Gregorio Sablan (Northern Mariana Islands delegate). Each of these files is a list of tweet IDs, which uniquely identify tweet objects in the Twitter API. Metadata about how user accounts were identified is stored in the corresponding README files. Using the script get_twitter_data.py, we pulled down a random sample of 10,001 tweets from the House of Representatives (10001_house.zip) and 50,000 tweets from the Senate (50000_senate.zip).

Our second data set is legislators-current.csv, which contains (among other variables) the following information on all current members of Congress: name, state, chamber (House or Senate), district (if House), party, website, and social media account names. We use this data set to identify the political party of each twitter account in the data set. Because this file comes from a different source than our twitter data and some politicians use multiple twitter accounts (for example, @POTUS versus @realDonaldTrump), some manual cleaning was needed to make sure all accounts in the twitter data set are present in the congress data set. In the script add_congress_data.R, we “fill in” this information, which mostly ended up just being replacements with different capitalization.

Now that we have two data sets that completely match on twitter username, we can transform the data into the form we want. The json_to_df.R script takes in the tweets as JSON files, extracts the information we’re interested in from each tweet, and creates a dataframe out of this. Each row of this dataframe is a tweet, and the columns are variables like tweet id, timestamp, text, and author. The tidy_text.R script parses out the content of the tweets and counts the occurrences of each word by user, scales each row and column, then joins this with the congress_df dataset to make full_data.RData. Each row of this dataset is a user, each column is a different word used, and the entries are scale proportions of how often a user used each word. For ease of computation, only words used by at least 10 distinct users were considered.

Exploratory Data Analysis

In the file make_plots.R, we plot some basic results of the data.

The top plot shows how often members of each party use each word on a log scale. For example, Republicans use the word “senate” about 0.6% of the time, where Democrats use it about 0.4% of the time. The red line represents equal usage between Democrats and Republicans. The bottom plot shows the log odds ratio log(Democrat usage/Republican usage) for the 15 words used most by each party compared to the other. Not all words can be shown in the first plot, so let’s break this up into a couple different categories.

While some of these make intuitive sense (more Democrats tag other Democrats, and vice versa), one interesting note is that Democrats tag both @housegop and @senatedems more, and Republicans are more likely to tag @foxnews, @foxbusiness, and @aipac (the American Israel Public Affairs Committee, a pro-Israel lobbying group).

In the use of hashtags, we see some opposites between the two parties: #obamacare vs. #trumpcare, #passthebill vs. #killthebill (in regards to the tax reform bill), #marchforlife and #pro_life vs. #istandwithpp. Some other perceived talking points of the two parties emerge: the Iran nuclear deal and the Keystone XL pipeline for Republicans, and climate change and the Trump-Russia investigation for Democrats.

The “regular words” (not hashtags or tagged users) used have a few more potentially uninteresting words (such as “morning”), but we can still see a few things:

Modeling

PCA

It turns out that visualizing a data set with 4345 variables is tricky, to say the least. To get around this, we applied PCA to see what actually made a difference in the data.

d <- full_data[,-(1:2)]
pca1 <- prcomp(d)
pc_df <- data.frame(PC = 1:20,
                    PVE = pca1$sdev[1:20]^2 / sum(pca1$sdev[1:20]^2))
ggplot(pc_df, aes(x = PC, y = PVE)) +
  geom_line() + 
  geom_point()

From the scree plot here, we can see that the first 3 PCs really account for the vast majority of the structure in the data.

scores_df <- data.frame(user = full_data$twitter,
                         party = full_data$party_id,
                         PC1 = pca1$x[,1],
                         PC2 = pca1$x[,2],
                         PC3 = pca1$x[,3],
                         PC4 = pca1$x[,4]) %>%
  left_join(congress_df, by = c("user" = "twitter"))

loading_df <- data.frame(word = colnames(d),pca1$rotation[ ,1:4])

ggplot(scores_df, aes(x=PC1, y = party)) + geom_jitter()

The first principal component does a pretty good job of encoding what party the user belongs to, Democrat (-) or Republican (+).

ggplot(scores_df, aes(x=PC2, y = chamber_type)) + geom_jitter()

The second principal component appears to distinguish Representatives (+) from Senators (-).

kable(arrange(loading_df, desc(PC3))[c(1:10,4336:4345), ])
word PC1 PC2 PC3 PC4
1 families -0.1239068 0.0117368 0.2008988 0.0796565
2 #trumpcare -0.3533959 0.0060057 0.1374581 0.0205737
3 #paymoreforless -0.1578449 0.0385303 0.1311770 0.0086349
4 seniors -0.0913230 0.0124293 0.1084742 0.0452491
5 #veteransday 0.0270323 -0.0755258 0.0879955 -0.1819736
6 hurt -0.0749704 0.0211459 0.0856073 0.0204101
7 coverage -0.1592510 0.0102826 0.0851376 0.0453163
8 advances 0.0275588 -0.0726092 0.0801361 -0.1647117
9 tie 0.0242893 -0.0590068 0.0733977 -0.1526220
10 #trumpcares -0.0537211 0.0136856 0.0710925 0.0265834
4336 #trumprussia -0.0376904 -0.0177762 -0.0960597 -0.0442235
4337 investigate -0.0281191 -0.0817810 -0.0988678 0.0606355
4338 credible -0.0300072 -0.0098416 -0.1010107 -0.0293986
4339 chairman -0.0100436 -0.0009404 -0.1021721 -0.0521272
4340 conduct -0.0272213 -0.0198882 -0.1037505 -0.0148771
4341 investigation -0.0397529 -0.0429024 -0.1378704 -0.0358674
4342 independent -0.1316147 -0.0430895 -0.1391207 -0.0441689
4343 committee -0.0335960 -0.0242492 -0.1434073 -0.0752751
4344 russia -0.0367527 -0.0614352 -0.1638510 -0.0174157
4345 nunes -0.0746828 -0.0205165 -0.1785077 -0.0933248

The third principal component weights users differently based on whether they talk more about health care (+) or the Russia investigation (-).

We expected the first one to be party, and spent a while trying to figure out what the second PC could be (but it makes sense that chamber shows up). The 3rd one, however, was the most surprising.

Below we’ve plotted a summary of the first two components, along with the non-text variables we think they best encode.

ggplot(scores_df,aes(x = PC1, y = PC2, color = party, shape = chamber_type)) +
  geom_point() +
  scale_color_manual(values=c("#619CFF","#00BA38","#F8766D")) +
  scale_shape_manual(values=c(1,16))

Clustering

km1 <- kmeans(d, centers = 1)
km2 <- kmeans(d, centers = 2, iter.max = 10, nstart = 20)
km3 <- kmeans(d, centers = 3, iter.max = 10, nstart = 20)
km4 <- kmeans(d, centers = 4, iter.max = 10, nstart = 20)
km5 <- kmeans(d, centers = 5)
km6 <- kmeans(d, centers = 6)
km7 <- kmeans(d, centers = 7)

bub <- data.frame(ClusterNumber = 1:7,
                  tot.within.ss = c(km1$tot.withinss,
                                    km2$tot.withinss,
                                    km3$tot.withinss,
                                    km4$tot.withinss,
                                    km5$tot.withinss,
                                    km6$tot.withinss,
                                    km7$tot.withinss
                                    ))
ggplot(bub, aes(x = ClusterNumber, y = tot.within.ss)) +
  geom_line() +
  geom_point()

This is a scree plot for the number of clusters applied to the data. No clear elbow exists in the plot, implying that there is no strong clustering of the data. Below we plot some of these clusters on the first two principal components.

cluster_df <-data.frame(party = scores_df$party,
                        chamber = scores_df$chamber_type,
                        PC1 = pca1$x[,1],
                        PC2 = pca1$x[,2],
                        k2 = km2$cluster, k3=km3$cluster, k4=km4$cluster)
ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k2),shape=party)) +
  geom_point() +
  scale_shape_manual(values=c(1,17,16))

ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k3))) + geom_point()

ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k4))) + geom_point()

In this analysis, 2 clusters separate the parties, 3 clusters group the entire Senate together and split the House by party, and 4 clusters adds a mysterious 4th group (sometimes this splits the Senate into parties and sometimes it sprinkles group 4 throughout, it’s very variable). We can see how well the 2-clustering assigns to party:

conf <- table(cluster_df$k2, cluster_df$party)
kable(conf)
Democrat Independent Republican
57 1 269
178 1 0

If we consider it to be a “classification model”, the 2-cluster has a MCR of 0.1146245. Overall, the clustering seems to agree with our PCA in that the most identifiable feature is party, followed by chamber.

Naive Model

naive_mcr <- mean(scores_df$party != "Republican")
kable(scores_df %>% group_by(party) %>%
  summarize(n = n(),
            prop = n/506))
party n prop
Democrat 235 0.4644269
Independent 2 0.0039526
Republican 269 0.5316206

Our most naive model is just the mode, that every politician in the data set is a Republican. This gives us a misclassification rate of 0.4683794. Any model that can improve on this (not a hard task) will give us more insight into the data.

Logistic Models

Because (with 2 exceptions) we’re seeking to classify into two parties, logistic regression makes sense. To fit the model, we remove the two independent senators (Bernie Sanders, VT; and Angus King, ME) from our data set. Because of the exceedingly large number of predictors, a restricted model with the lasso or ridge techniques is appealing. We use 5-fold cross-validation to prevent overfitting our models.

no_ind <- filter(full_data, party_id != "Independent") %>%
  mutate(party_id = factor(as.character(party_id)))

logit_ridge <- glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=0)
ridge_grid <- exp(seq(0,5,length.out=50))
ridge_cv <- cv.glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=0, nfolds=5, type.measure = "class", lambda = ridge_grid)
ridge_bestlam <- ridge_cv$lambda.min
ridge_pred <- predict(logit_ridge, s = ridge_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "class")
ridge_mcr <- mean(ridge_pred != full_data$party_id)

logit_lasso <- glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=1)
lasso_grid <- exp(seq(-6,-2,length.out=50))
lasso_cv <- cv.glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=1, nfolds=5, type.measure = "class", lambda = lasso_grid)
lasso_bestlam <- lasso_cv$lambda.min
lasso_pred <- predict(logit_lasso, s = lasso_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "class")
lasso_mcr <- mean(lasso_pred != full_data$party_id)

plot(ridge_cv)

plot(lasso_cv)

Because both models perform better than \(\lambda=0\) (regular logistic regression), we can feel confident in choosing one of these over the full logistic model. For the dataset, our ridge MCR is 0.0513834, and our lasso MCR is 0.0039526. The lasso being better makes some intuitive sense, as we would expect some words to be meaningless for prediction. We can examine which words were non-zero in the lasso model:

kable(data.frame(word = colnames(full_data)[-(1:2)],
                coeff = as.vector(predict(logit_lasso, s = lasso_bestlam, type = "coefficients"))[-1]) %>%
  filter(coeff !=0) %>%
  arrange(desc(coeff)))
word coeff
forward 0.2468251
obama 0.2424155
#obamacare 0.0975044
energy 0.0105299
aerial -0.0011912
#findyourpark -0.0033991
@senatortester -0.0046506
requiring -0.0070628
nutrition -0.0116992
deserve -0.0217419
corporations -0.0373750
sj -0.0504059
bill -0.0535677
views -0.0701874
critical -0.0724234
massive -0.0736773
bipartisan -0.0749062
ties -0.2192227
hate -0.2305272
predatory -0.2339835
environment -0.2378165
tear -0.2475724
protect -0.2776202
stem -0.2820718
facebook -0.2861099
recreation -0.3404947
civil -0.3439464
demands -0.3607009
background -0.3865282
#usa -0.4315452
transparent -0.4375030
deal -0.4503333
prioritize -0.4951557
@sencortezmasto -0.5136071
farming -0.5522052
million -0.5619746
default -0.5872521
@timkaine -0.6079599
recuse -0.6398649
discrimination -0.6629120
sad -0.6640538
afford -0.6725954
people -0.6852473
#paymoreforless -0.7016892
@housegop -0.7207613
bannon -0.7362035
medicare -0.7452185
@gop -0.7672465
services -0.7827530
@senrobportman -0.7942401
trumpcare -0.7955435
backwards -0.7991641
shutdown -0.8336448
tomorrows -0.8363293
nunes -0.8386618
constitution -0.8393652
aca -0.8592723
acres -0.9188310
environmental -0.9333859
resignation -0.9387459
sens -0.9510586
pulling -1.0048228
scott -1.0074662
coverage -1.0179053
@senfranken -1.0572592
@epa -1.0688921
average -1.0715685
fortune -1.1175015
robotics -1.1712294
foreign -1.2789617
voices -1.2987092
gop -1.4175229
homes -1.4678952
interference -1.5706041
extreme -1.5916222
dont -1.6495215
#actonclimate -1.6495964
independent -1.7330026
transgender -1.7427259
task -1.7563759
tuned -1.7807651
americans -1.8535850
voting -1.9207576
internet -1.9470149
blame -2.0011513
#equalpayday -2.0954791
pruitts -2.1094533
hour -2.1412538
fargo -2.1976930
trumps -2.3150519
seniors -2.3616638
package -2.4792467
partisan -2.5739030
cut -2.8286217
dem -3.0239072
#aca -3.1591187
gops -3.2006990
oppose -3.3608927
pay -3.4491225
overdose -3.4600498
base -3.7206483
trump -4.3218156
unacceptable -4.3539389
#broadbandprivacy -4.7144419
#trumpcare -7.0142632

In this list we can see some of the same topics that we found through PCA analysis, like health care and the Russia investigation, as well as net neutrality and the EPA/climate change. Judging by the magnitude of the coefficients on each side of 0, more words typically used by Democrats (as determined by our exploratory data analysis) were important in deciding what party a user belonged to.

We can also check which users were misclassified by the lasso and ridge models.

lasso_missed <- data.frame(twitter = full_data$twitter,
                     state = scores_df$state,
                     party = full_data$party_id,
                     pred = lasso_pred,
                     prob = as.vector(predict(logit_lasso, s = lasso_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "response")),
                     stringsAsFactors = FALSE) %>%
  filter(party != X1) %>%
  arrange(desc(prob))
kable(lasso_missed)
twitter state party X1 prob
SenAngusKing ME Independent Republican 0.8649459
SenSanders VT Independent Democrat 0.1097487
ridge_missed <- data.frame(twitter = full_data$twitter,
                     state = scores_df$state,
                     party = full_data$party_id,
                     pred = ridge_pred,
                     prob = as.vector(predict(logit_ridge, s = ridge_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "response")),
                     stringsAsFactors = FALSE) %>%
  filter(party != X1) %>%
  arrange(desc(prob))
kable(ridge_missed)
twitter state party X1 prob
repdavidscott GA Democrat Republican 0.5778697
SenAngusKing ME Independent Republican 0.5683941
RepGonzalez TX Democrat Republican 0.5645812
RepAlGreen TX Democrat Republican 0.5611652
RepSinema AZ Democrat Republican 0.5514223
AnthonyBrownMD4 MD Democrat Republican 0.5391477
RepBetoORourke TX Democrat Republican 0.5370081
RepJimCosta CA Democrat Republican 0.5358524
RepDerekKilmer WA Democrat Republican 0.5331092
Sen_JoeManchin WV Democrat Republican 0.5258514
RepOHalleran AZ Democrat Republican 0.5200528
SenDonnelly IN Democrat Republican 0.5192061
SenatorTester MT Democrat Republican 0.5190576
RepStephMurphy FL Democrat Republican 0.5179823
MarkWarner VA Democrat Republican 0.5151879
RepJoshG NJ Democrat Republican 0.5150400
SenatorHeitkamp ND Democrat Republican 0.5142368
RepTomSuozzi NY Democrat Republican 0.5104539
SenBillNelson FL Democrat Republican 0.5091237
RepBobbyRush IL Democrat Republican 0.5070259
RepLipinski IL Democrat Republican 0.5065274
RepJoseSerrano NY Democrat Republican 0.5013563
McCaskillOffice MO Democrat Republican 0.5009417
SenStabenow MI Democrat Republican 0.5007496
teammoulton MA Democrat Republican 0.5000421
SenSanders VT Independent Democrat 0.2931826

Independents Angus King and Bernie Sanders both caucus with the Democrats, so we can consider Senator Sanders’ classification correct. Of note is that both of our logistic models only misclassified Democrats as Republicans! In addition, many of these congresspeople are Democratic legislators from majority Republican states like West Virginia, Texas, and Georgia.

Plotting these missed users among all the points, we see that most of them are Democrats grouped in the Republican cloud to the right. This seems to agree with our earlier statement that PC1 encodes party.

Boosted Tree

binary <- mutate(no_ind, party_id = ifelse(party_id == "Democrat", 0, 1))
boost_tweet <- gbm(party_id ~ .-twitter, data = binary,
                 n.trees = 1000,
                 shrinkage = 0.03)
## Distribution not specified, assuming bernoulli ...
boost_pred <- predict(boost_tweet,
                      newdata = full_data,
                      n.trees = 1000,
                      type = "response") > .5
boost_pred <- replace(boost_pred, boost_pred==TRUE, "Republican")
boost_pred <- replace(boost_pred, boost_pred==FALSE, "Democrat")
boost_mcr <- mean(boost_pred != full_data$party_id)

kable(head(summary(boost_tweet),20))
var rel.inf
#trumpcare #trumpcare 45.9002656
#aca #aca 5.0380427
coverage coverage 4.0169797
seniors seniors 1.9648526
trump trump 1.7280736
protect protect 1.6795244
obamacare obamacare 1.6716458
people people 1.6479331
aca aca 1.5418171
million million 1.4770039
bill bill 1.4473962
voting voting 1.3089384
ties ties 1.2633074
gop gop 1.2021974
unacceptable unacceptable 1.1685083
#broadbandprivacy #broadbandprivacy 0.9834594
dont dont 0.9585029
oppose oppose 0.8583855
aisle aisle 0.8447608
introduced introduced 0.8283877

Our boosted tree’s MCR is 0.013834. In the boosted tree, we really see #trumpcare stand out in variable importance, as well as themes like health care and net neutrality.

boost_missed <- data.frame(twitter = full_data$twitter,
                     state = scores_df$state,
                     party = full_data$party_id,
                     pred = boost_pred,
                     prob = as.vector(predict(boost_tweet,
                      newdata = full_data,
                      n.trees = 1000,
                      type = "response")),
                     stringsAsFactors = FALSE) %>%
  filter(party != pred) %>%
  arrange(desc(prob))
kable(boost_missed)
twitter state party pred prob
SenAngusKing ME Independent Republican 0.8874329
RepGonzalez TX Democrat Republican 0.7523050
AnthonyBrownMD4 MD Democrat Republican 0.6321009
RepAlGreen TX Democrat Republican 0.5391994
RepJimCosta CA Democrat Republican 0.5356138
repdavidscott GA Democrat Republican 0.5346318
SenSanders VT Independent Democrat 0.2503776

We see that many of the same congresspeople get misclassified in the boosted tree as in the logistic models.

Random Forest

We attempted regular bagging as well, but that was computationally infeasible.

tweet_rf <- randomForest(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, importance = TRUE)
rf_pred <- predict(tweet_rf, newdata = data.matrix(full_data[ ,-(1:2)]), type = "response")
rf_mcr <- mean(rf_pred != as.character(full_data$party_id))

Our random forest misclassification rate is 0.0059289. This is slightly higher than the lasso, but still a good bit better than the ridge model.

importance <- data.frame(tweet_rf$importance) %>%
  rownames_to_column() %>%
  arrange(desc(MeanDecreaseAccuracy))
kable(head(importance, 20))
rowname Democrat Republican MeanDecreaseAccuracy MeanDecreaseGini
#trumpcare 0.0159837 0.0583884 0.0385616 11.769370
#paymoreforless 0.0031985 0.0230981 0.0137832 4.151707
#broadbandprivacy 0.0023760 0.0183534 0.0109121 4.098199
oppose 0.0019396 0.0182084 0.0105483 3.563669
#aca 0.0016644 0.0168051 0.0097134 3.475906
coverage 0.0029010 0.0118868 0.0076073 4.186002
million 0.0013977 0.0125869 0.0073688 3.910890
republicans 0.0017508 0.0122367 0.0073023 3.316491
lose 0.0006299 0.0112252 0.0063364 3.047679
#protectourcare 0.0013749 0.0103399 0.0062596 2.083547
independent 0.0011110 0.0100612 0.0059082 2.588138
americans 0.0040660 0.0069378 0.0055721 3.271237
voting 0.0010512 0.0089394 0.0052977 2.864924
trump 0.0021525 0.0074223 0.0049156 2.634625
cuts 0.0009349 0.0081037 0.0046771 2.003964
trumps 0.0005685 0.0082164 0.0046750 2.485621
aca 0.0011696 0.0075797 0.0045718 2.589078
@housegop -0.0005506 0.0081371 0.0040981 2.866806
seniors 0.0012471 0.0058954 0.0037588 1.492273
nunes 0.0008528 0.0061879 0.0037032 1.713277

Many of the same words from earlier appear to have high variable importance in the random forest we fit. The most important words here also correspond to words that are most often used by Democrats, which is interesting.

rf_missed <- data.frame(twitter = full_data$twitter,
                        state = scores_df$state,
                        party = full_data$party_id,
                        pred = rf_pred,
                        prob = predict(tweet_rf,
                                                 newdata = data.matrix(full_data[ ,-(1:2)]),
                                                 type = "prob")[ ,2],
                     stringsAsFactors = FALSE) %>%
  filter(party != as.character(pred)) %>%
  arrange(desc(prob))
kable(rf_missed)
twitter state party pred prob
SenAngusKing ME Independent Republican 0.762
RepGonzalez TX Democrat Republican 0.584
SenSanders VT Independent Democrat 0.126

In addition to the two Independents, the forest misclassified Rep. Vicente González of Texas’ 15th Congressional District.

Discussion

To recap, our models’ misclassification rates were:

Our first 3 PCs encoded party, chamber, and whether a user talked more about health care or the Russia investigation, respectively. Our clustering grouped first by party, then by chamber.

We can think about which members of Congress our non-naive models found it harder to classify.

missed <- data.frame(missed = unique(c(ridge_missed$twitter, lasso_missed$twitter, boost_missed$twitter, rf_missed$twitter))) %>%
  left_join(congress_df, by = c("missed" = "twitter"))
kable(missed)
missed last_name first_name chamber_type state party_id
repdavidscott Scott David rep GA Democrat
SenAngusKing King Angus sen ME Independent
RepGonzalez Gonzalez Vicente rep TX Democrat
RepAlGreen Green Al rep TX Democrat
RepSinema Sinema Kyrsten rep AZ Democrat
AnthonyBrownMD4 Brown Anthony rep MD Democrat
RepBetoORourke O’Rourke Beto rep TX Democrat
RepJimCosta Costa Jim rep CA Democrat
RepDerekKilmer Kilmer Derek rep WA Democrat
Sen_JoeManchin Manchin Joe sen WV Democrat
RepOHalleran O’Halleran Tom rep AZ Democrat
SenDonnelly Donnelly Joe sen IN Democrat
SenatorTester Tester Jon sen MT Democrat
RepStephMurphy Murphy Stephanie rep FL Democrat
MarkWarner Warner Mark sen VA Democrat
RepJoshG Gottheimer Josh rep NJ Democrat
SenatorHeitkamp Heitkamp Heidi sen ND Democrat
RepTomSuozzi Suozzi Thomas rep NY Democrat
SenBillNelson Nelson Bill sen FL Democrat
RepBobbyRush Rush Bobby rep IL Democrat
RepLipinski Lipinski Daniel rep IL Democrat
RepJoseSerrano Serrano José rep NY Democrat
McCaskillOffice McCaskill Claire sen MO Democrat
SenStabenow Stabenow Debbie sen MI Democrat
teammoulton Moulton Seth rep MA Democrat
SenSanders Sanders Bernard sen VT Independent

Again, we ignore Senator Sanders’ misclassification because he is considered farther to left than the rest of the Democratic party and caucuses with Democrats. Many of the people misclassified are members of the Blue Dog Coalition, a House Caucus of “fiscally-responsible Democrats” who are traditionally more conservative than the party in general.

blue_dogs <- c("Costa", "Cuellar", "Lipinski", "Bishop", "Cooper", "Correa", "Cirst", "Gonzalez", "Gottheimer", "Murphy", "O’Halleran", "Peterson", "Schneider", "Schrader", "Scott", "Sinema", "Thompson", "Vela")
kable(filter(missed, !(last_name %in% c(blue_dogs, "Sanders"))))
missed last_name first_name chamber_type state party_id
SenAngusKing King Angus sen ME Independent
RepAlGreen Green Al rep TX Democrat
AnthonyBrownMD4 Brown Anthony rep MD Democrat
RepBetoORourke O’Rourke Beto rep TX Democrat
RepDerekKilmer Kilmer Derek rep WA Democrat
Sen_JoeManchin Manchin Joe sen WV Democrat
SenDonnelly Donnelly Joe sen IN Democrat
SenatorTester Tester Jon sen MT Democrat
MarkWarner Warner Mark sen VA Democrat
SenatorHeitkamp Heitkamp Heidi sen ND Democrat
RepTomSuozzi Suozzi Thomas rep NY Democrat
SenBillNelson Nelson Bill sen FL Democrat
RepBobbyRush Rush Bobby rep IL Democrat
RepJoseSerrano Serrano José rep NY Democrat
McCaskillOffice McCaskill Claire sen MO Democrat
SenStabenow Stabenow Debbie sen MI Democrat
teammoulton Moulton Seth rep MA Democrat

Of the remaining members, many come from rural, southern, or typically Republican states, and our models may have picked up on some of the topics they tweet about that line up more with Republicans’ tweets. One thing of note is that every model misclassified Maine Senator Angus King as a Republican, even though he is an Independent who caucuses with the Democrats. Maine has a history of strong independent parties, and King is a former Democrat who left the party before running for governor (against Susan Collins, the other current senator from Maine). Upon leaving the party, King stated that “The Democratic Party as an institution has become too much the party that is looking for something from government,” indicating that he has some views sympathetic with Republicans (or at least dissimilar to Democrats).

In terms of variable importance, our models noted many of the same words that we saw in our exploratory data analysis. #trumpcare was almost always the most important variable, and important topics included health care, the Russia investigation, and net neutrality. Most of the words considered “important” were words used more often by Democrats, which is interesting. One reason for this might be that Democrats occupied a wider range of scores on PC1, indicating that their tweets were more dissimilar and thus harder to classify.

Ideas for Further Analysis

While our research question just focused on inference, originally we set out to build a predictive model. Because politicians’ official Twitter accounts use such different words than “regular people”, however, we would have either had to

We found logistical and ethical issues with the first option, and didn’t see the point in building the second model, as almost all politicians list their party openly on their twitter account. This second method could be used, however, to maybe predict how a “non-partisan” elected official would actually act in practice. Different data would maybe be required to fit that specific need.

Because we were only interested in inference, we didn’t see as much of a need to worry about overfitting or model validation as if we were building predictive models. To verify the inferences we made, another data set could be created and used to test our models. However, this test data set, although it would certainly contain different tweets from our training data, would have tweets from the same people as our training data. The two data sets would not be independent in this way. Because our data set for fitting models had a relatively low ratio of observations to predictors (< 1/4), we decided to use all the data we had collected rather than just a random sample of users.

References

Littman, Justin, 2017. “115th U.S. Congress Tweet Ids”, Harvard Dataverse, V1, http://dx.doi.org/10.7910/DVN/UIVHQR.

Repository “congress-legislators” in GitHub group “unitedstates”. https://theunitedstates.io/congress-legislators/legislators-current.csv.